Logging, Monitoring and Observability
Logging, monitoring and observability. This is a huge topic and each portion itself. If we talk about logging or if we talk about monitoring and if we talk about observability, they themselves are deserving enough to get their own discussion, get their own video. But this is also one of those topics where not everything is predecided. Uh there are no rules. It's more of a spectrum. These are obviously practices logging, monitoring and observability. I will get
to why do we need this and what are these exactly but these are practices which most companies most individuals startups they implement in a spectrum. We can never say that our product our company follows all the good logging monitoring and observability practices. There is no such thing as that. So given that we just wanted to put that information out there so that we don't uh feel intimidated uh later on when we see all these different different keywords practices products and tools
that get used all over the industry. Now what is what are these terms? What is logging? What is monitoring and what is observability? uh as I said this since this is a huge topic I don't want to spend hours discussing these uh even though we can spend hours uh these topics are huge but we'll keep this to limited since uh these are more of a practices than theory I'll also show more code in this video even though from the start of this series I have kept
this as a rule that we won't look at code but this is one of those videos where if you don't look at code this does not really make any sense All these practices they are pretty closely related to code, how we implement them, how we follow these practices. That's why we'll look at code how these work in a real application. Now coming back to why do we need this? Now since we know that uh these days in modern uh internet all our backend applications, all our fullstack applications they run in a
distributed environment and they run in different different servers. They run in different different regions and our users are spread across the whole world. Given these scenarios, we need practices, we need tools and we need methodologies so that we can keep track of what's happening in all our services, in all our infrastructure tools and everything. And what do I mean by keep track? There are a couple of uh parameters we can keep track of. There are a lot of parameters but we can boil
them down to some of the important ones. Now logging basically means we record record. So it's kind of a record all the events that are happening in our application. Again these practices login monitoring and observability these also apply for front-end applications your web applications but we will keep the discussion limited to backend applications. As I was saying, logging is basically recording all the important
events, all the suspicious events, all the security related events. Pretty much every important event that happens in our application, in our back end application, we want to record that event with some metadata. Metadata basically means what was the user ID which triggered this particular request and what was the latency of this request and what particular method what particular function was triggered for this request and all these different different media that will come in handy
when we are actually trying to understand our system. So the first thing the first component of this whole methodology is logging which is basically recording all the events from the whole life cycle of requests and our application execution. The second thing is monitoring. Monitoring basically is exactly the same as the way it sounds. We want some way of keeping track. We cannot simplify it more than that. We want to monitor the state of all our applications, our backend application and different components of our backend
application. The server's CPU, the server's uh memory, how many requests the server is processing per second at this particular time and what is the state of our database connections, how many database connections are open because we are using database pooling etc. Monitoring basically means having realtime data about our system. Not exactly real time but a few minutes differing because one thing about logging monitoring and observability all
these practices these data even though there are tools where we can monitor in real time with a few milliseconds few seconds deferring but usually without traditional tools without traditional practices there is some kind of delay because uh we don't want to overwhelm our whole logging monitoring and observability systems with constant data with every millisecond with every second. So there is some kind of delay around 10 second or 15 seconds. Monitoring basically means that having
real time data real time we mean 10 15 seconds of delay. Having realtime data about the state of our system that's all about monitoring and the next is observability. Now observability is in itself includes a lot of other practices and if you go by uh theoretical terms observability has uh three pillars. We like to call them as pillars because a system a backend system is can only be
called observable if all these three practices if all these three components are in place. So those are one is basically logs. second metrics and third is traces. Now logs we already know logs are basically record of all the important event that is happening across our uh request life cycle our application startup application ending the whole phase metrics metrics is
something which closely relates to monitoring I'll come back to this part a little later and same way traces traces you can imagine traces as transactions a trace basically means that we want some way of tracking at what point a request originated from one system. It can be a front- end system. It can be a load balancer or it can be starting from your backend application. From that point on, what are all the components? Uh let's
say if you have a handler layer, you have a service layer, you have a validation layer, you have a repository layer and then the database layer. Using traces we can track uh when a request originated where it originated from and we can track all these components that it touched through while our backend application is was executing this particular request. A trace is basically a transaction including all the different components that it involves.
Basically we'll come back to more formal definitions and more explanations a little later on but just for the starting part I gave a very high level introduction to logging monitoring and observability on a very high level now this whole practice of observability is uh pretty modern a few years it's only been a few years using a traditional monitoring let's say decade back the primary form of preventing errors or
catching errors or any kind of error handling on an infrastructure level was mostly dependent on monitoring practices. But the problem with that is monitoring only tells you uh that there is a problem. We have different different tools. We have different different workflows to set up alerts using graphana. We'll come back to what are the tools we have available in our arsenal. But using all these tools, using all these monitoring methodologies, we can know we'll know that there are issues in our application
and we'll get alerts and everything. But that's pretty much it. Uh we cannot really do anything about it. We'll just get informed that there is something wrong with your application. And with this whole new movement of observability, it will inform you that there is something wrong with your application. But it will also tell you exactly what is wrong with your application given that you have followed all the observability practices. You have implemented logs, you have implemented metrics and you have implemented traces. But that is the
oneline definition and explanation of observability. Now let's uh drill down on different different components. The logging, monitoring and observability part. So the first one was logging and as I said logging is the practice of recording all the important events in our whole application life cycle. Examples like let's say a user logs in or we execute a database query or something fails and we log that error
with time stamp with the user ID with the database query and everything all the context that we need to debug that error. You can think of logs as kind of a journal or a diary your backend application maintains so that when the time comes we will know what happened when happened and exactly why it similarly monitoring monitoring as I said it's uh getting real time data getting basically continuously checking
the health and the performance of your system and tracking the patterns over time and giving an aggregation data that uh these this was the performance this was the behavior of your application over time including the current time right that is all about monitoring and the third one is observability we can call uh a system as observable if we can determine the internal state of the application by looking at different different parameters by looking at the
external outputs that's when we call that a system is observable. Now, how do all these three things work together? They work together because each of them produce one concrete parameter that is useful when we are debugging our system, when we want to understand our system. For example, using logs, we will know that what happened, what exactly happened. Using monitoring, we'll know about metrics. We'll talk more about
what exactly this term metrics mean. But using metrics we'll know about patterns and trends. Now the third one using observability we'll find traces. Traces is linked with observability. Traces as I said they are transactions and using these we can find out the interaction of different components component interactions. So this is uh if you have already worked in a production system which implements proper logging monitoring and observability this is how
it works. Let's say we have uh set up some alert parameters that if our error rate goes beyond 80%. Something like this then we'll get an message in Slack that something is wrong with your API service. You should look into that. And from there we'll go to metrics. In metrics we'll see that metric is basically different different parameters different different realtime parameters and historical parameters that we can
track about our system. Some of the examples are how many requests we have processed. How many requests have failed and failed basically means any request which gave the status code is more than 200. We can consider that as failed request. Similarly, if let's say your application deals with to-do list, then how many to-dos we have created and how many of them failed. All these things, all these concrete numbers, when you're
talking about numbers over either historical data or the current numbers in the last 30 minutes, in the last 1 hour, these things are called metrics and we can preddecide these when we are configuring our whole logging monitoring and observable system. We can configure these things that these are the metrics that we care about both in our code and in our logging monitoring and observability whatever tool that we are using. So if you get an alert because your error rate just went above 80%. We get an error in slack using some kind of
web hook then from there you'll look at the metrics and in the metrics you'll see that you are getting more errors is your error rate is more than 80%. From there you will jump into usually in systems uh whatever system that you are using either we have open source tools like graphana prometheus and jagger for traces and same way we have other tools like new relic which is something that we will see in this video which is kind of a one-stop solution if you don't want to go to the opensource route
configuring all the opensource applications etc. So you if you're using something like Grafana which is an open source tool or New Relic dashboard is a proprietary software from the metrics we can immediately find the logs let's say it says from the metric it says that your error rate is more than 80%. And it will also show you that these are the logs related to that metric. Basically all the failed logs and from the logs you can directly jump into traces. Let's say you saw a log where it says 500 something and when you click on this
it'll also show you traces. Basically for this log the request started at this function and then it traveled to this function then it traveled to this function and it failed at this particular point. And using this whole workflow you can find out exactly where things went wrong and you can instantly debug it. And that is the whole benefit of implementing this whole logging, monitoring and observability in your backend systems. Next up, we'll talk a little more about different different
concepts in logging concepts in monitoring and different concepts in observability. But before that, let me tell you a little bit about the sponsor of this video is Seala. Seala is a platform as a service provider. If you have used services like Netlifi, Versel, Heroku, then there's an alternative to that. In Savala, you can deploy full stack applications and you can also deploy your databases. You can also deploy custom applications. Let's say we
were talking about observability a while back. So if you want to deploy Graphfana or Prometheus and Jagger and open telemetry collectors then you can deploy all those applications by dockerizing them and you can connect all these applications to your back end using their internal network if you deploy them in the same region. After deploying the first time, you can connect your GitHub repository and the next time you push something to whatever your main branch is, it gets autodeployed using a
GitHub bot. Savala uses Kubernetes under the hood uh using GCP with their premium network tier to completely abstract away all the complexity that you have to deal with if you had to do all these deployments manually using AWS or any kind of cloud providers. You never have to touch any kind of YAML files or manage any kind of containers. The platform supports three kind of build options. We have Nyx packs which by
default supports more than 20 languages with better resource efficiency than traditional build packs. It also supports build packs for compatibility with Heroku. If your infrastructure is already in Heroku, then it also supports build packs so that your migration is as smooth as possible. And if you want to go the custom way, it also supports deploying your applications to docker file and giving yourself the complete control about how you want to configure and how you want to deploy your applications. It also provides this very
useful feature if you are working in a team which is preview deployments. If you have used versel before, they also offer the similar kind of feature. And preview deployments basically means if someone one of the team members raise a PR then you will instantly get a domain and you can click on the domain and you can actually use your application with the all the changes from the new PR and then you can decide whether to merge the PR or to request for more changes and this comes in very handy if you are working in a team and also if you are a oneperson team you can quickly become
more productive with this kind workflow. All your applications run on Google's infrastructure with Cloudflare's edge network. Your static assets get cached across 260 plus points of presence using the Cloudflare's edge network. The internal communication between services are free and the bandwidth costs is 33% less than Versel at just 0.10 per GB. So if that sounds interesting to you, if that sounds more cost effective to you, then you can get started with
Savala by signing up with the below link you'll find in the description and you'll get $50 worth of credit free to try it out and decide for yourself if it makes sense or not. You'll stop spending hours fighting deployment and you'll actually start shipping features. Now coming back to our discussion logging. Let's talk a little bit about logging. What are the things that you have to keep in mind? Again this is very informal talk. We are not going to go with the definitions and all the different tools and components that you
have to know about. I'll keep it practical. Whatever you have to know to get started and to get to far enough so that you can figure things out for yourself. And the first thing when we are talking about logging is levels. Logging levels this is something that you will see a lot in production systems when we are logging a particular event usually if uh your library supports that which it should then we assign a particular level to that log and these levels are usually let's say something
like debug or info or warn let's say error or fat. These are the most common ones. At debug, we use the log level debug when we are in development and we are trying to basically debug something. We are trying to troubleshoot something and we need as much details about the behavior of the systems as possible. These debug logs can be a bit overwhelming. So that's why we only care about debug logs uh in development. We usually disable it in production. Then
comes the info logs. Uh these are general application operations. The business events let's say you have a to-do application then if a to-do gets created then we use log.info to log that event general information or successful operations any kind of successful operation we can use the info level. Same way warning uh these are events uh which are in between info and error. This is not a successful operation and this is also not critical enough that we
want to add it as an error. For example, if for a user the authentication fails then we just log that event as a warning that this user has typed a wrong password but this is not an error. This is not our fault. Right? So in those cases we can use the warn level. Same way for errors. Any kind of errors you can imagine the validation errors or if your database query fails all kinds of errors we log it as errors. This is the most common one. One of the reasons we
use logging in the first place. Then comes the fatal. The fatal is uh pretty serious. Once you log something with the level fatal, your application mostly stops and it restarts depending on your infrastructure configurations. But fatal basically means that this is a very serious bug, a very serious issue and your application is shutting down. And that's all about levels. And we'll go through the code to get more intuition about log levels and everything. But
let's just get through the theory part first and we'll see what it looks like. Then comes structured versus unstructured logs. We usually log in two different kinds of ways. The first kind is console logs. When you are in your development environment, when you are running your back end in your local, then we want to log whatever logs that we are recording throughout our application running. We want to log those in our console. Wherever you are running in your VS code console, in your
terminal, whatever console, we want to log them. And we want to log them in a readable way in a more attractive way because it's easier to understand. It's easier to spot issues and it's easier to fix them. So during development we want to keep the logs readable. We want to keep the logs attractive with colors and all. And we want to make it human readable in the sense it should be plain text. The same way structural logging. Structural logging basically means JSON is the most popular format for
structural logging. Instead of printing human readable text what we do we print the error in JSON format and it will have all the parameters that what is the status of the error what is the message of the error and all these things but if we do this in development environment uh it's hard to read and it's easy to miss issues miss errors in our system that's why we use JSON logging or structured logging in our production systems because in our production system. All
these logs are either depending on your configuration are either getting parsed through some kind of log management application. Let's say we have ELK stack or if you are using the open-source tools like Loki, Promptale, Graphana, the whole graphana stack. All these logs are getting parsed. If they are in text format, then all these tools they'll have a hard time parsing it and they'll face a lot of errors. which is not efficient enough to parse a line of text
and extract valuable parameters like the user ID, the request ID and things like that. That's why in production system we log in JSON format so that it's easy to parse and it's easy to extract all the valuable information. That's all about structured and unstructured logging. In development we use unstructured logging to make it userfriendly to make it easy to spot errors in development environment. In production, we log in JSON format so that it's easier for our tools, the log management tools to parse our logs and give us the valuable insights. Since we are mostly talking
about practices, I think it is better to to show how they actually work instead of just talking in theory. Even though that's the whole point of this series, we'll make an exception for this video so that we have an easier understanding about this all complicated logging, monitoring and observability. This is the code of a to-do application and the back end is written in Go and this application tries to follow all the practices the logging monitoring and observability and we are using a tool called New Relic. If we look up new
relic, this is a complete solution for logging, monitoring and observability needs of different different systems. We also have tools called graphana which is actually the front end part the dashboard part and the back end is usually dealt with Prometheus which actually builds all these metrics and all. These are the tools that's used in most of the enterprises. the graphana stack that we like to call with
Prometheus, Graphfana, Promptale and Jagger for traces. But if you want a simple integration, a simple workflow, then something like New Relic makes more sense. If you don't have the team size, if you don't have the resources to maintain all these different open-source tools, then you can definitely go for something like New Relic. And that's exactly what I have done in for this application. Configuring all these logging, monitoring and observability is pretty complicated. If you don't have
the experience, if you don't have the time for it, then going with a proprietary solution makes more sense. So with that context, let me show how we have implemented logging monitoring and observability in this system. How they actually work and with this we'll understand how different different parts of logging monitoring and observability work. So the first part if we see so another thing is uh please don't try to understand the code don't try to uh
memorize the code and don't get overwhelmed the whole point of this video is just to show how the operations work how the practices work in the logging monitoring and observability topic not the code but we are just using the code. This is the file which creates a new logger before our application starts and it has a couple of configurations. The first thing that I want to show because we just talked about is different different levels of logging. We have a function called get
log level which basically checks if we are in development mode which means that we are running in our local or if we are in production mode which means that we have deployed our application it's running in our infrastructure depending on that it either returns an info level or a debug level because in local we want more logs for debugging purpose and in production we want just the information related logs And using this
function we can filter out we can configure that. The second thing is the structured versus unstructured logging thing. By default we have set the logging format as console because it is a development environment. But let's say we change it to JSON. And here instead of local we make it to now if we start the server task. As you can see that we are getting all the logs in JSON format and it's great for production
environments but it's not very readable. If we go ahead and change it back to the development setup from env. This is what the development blogs look like as this is definitely more readable in our local environment. We can clearly see that the first log is about connecting to database. The second one started the background job server and the third one is starting the server right as compared to the earlier uh JSON format which is
not very easy to understand but it's definitely easy to parse for other tools. Now the second thing that we talked about was monitoring and for that to work what we have done if we go to the router we have set up this middleware called uh Newelic middleware. If we go inside that is basically initializing a new middleware with wrapping our whole app. What happens exactly with this middleware is if we look at the source code every time a new
request comes it does something the it instruments the whole request. Now if you are talking about observability two terms you'll hear very frequently. One is instrumentation. The second one is open telemetry. Instrumentation is basically the practice of actually measuring different different attributes of your function which is closely related to when we call a system observable. The second thing is open telemetry. Open telemetry is a standard. It is a recent addition and it provides
a whole ecosystem of tools and SDKs and best practices and resources so that you can properly instrument your application. It does not matter what language that you're using whether it is NodeJS or Golang, Python does not matter. The community has built enough APIs, SDKs and tools for all the languages, all the major languages and it is an open standard even though we are using a proprietary tool like uh New
Relic. We can definitely integrate an open telemetry collector so that we can have more control about how we are instrumenting our requests, how we are instrumenting our components. That is a side note. If we take a look at one function which is called create to-do. It is a service which creates a new to-do. And if you try to track down what exactly are we doing with the whole logging monitoring and observability workflow in the first statement after
the function starts executing we are taking out the transaction which we have put inside a context. We have a middleware which is called enhanced tracing. And when this middleware is first triggered during the life cycle of a request, this is the first point of contact. Here we create a new transaction using we create a new transaction using the new relic method and we add some of the parameters like
the service name which we take from our configuration. Then the environment whether it is local or production the IP address and the user agent the request ID we fetch the request ID and the user ID the user email tenant ID etc. All the information that we need during our logs, our traces, we add it inside this transaction and then what we do, we add this particular transaction into our
context. So that when the request reaches the service method which is this method, we already have a transaction which we have saved inside our context. We can use that transaction to proceed further. What happens with this workflow is a single request will have a trace which will start from that middleware where we are creating the transaction then it will reach the validation layer then it will reach the service layer and all of this will be a single instance of a trace and that's how we'll be able to
debug our issues. So in the first statement we try to take out the transaction from our context and this statement basically means that when this function returns we want to end this transaction for this particular segment which is the to-do create service. Then since we service layer we want to add two more attributes to this particular transaction. The user ID and the title of the to-do that the user has requested for. Then comes our login. As we have discussed before, we want to log every single important event in our
application. That's why at this point we are logging that we are performing an operation which is called create to-do. This is the title of that to-do and we want to log this particular event. Same way if a priority is passed in the payload, the request payload that the user has sent, we are adding this attribute also to that transaction. It is available in the trace and we are also logging that. Then we are logging a message which is of info level that we
are initiating the process of creating a new to-do and here we are doing some kind of validation with the children and parent relation. Same way here at this point we execute the database operation which creates the to-do and if we face any kind of error at this point we are logging that error again with the level of error we are adding this error to our trace using this particular transaction and we are also adding the attribute that this operation create so that we can know that for this operation the
result is an error but if it is successful ful then we are adding that the operation is create and we have created a to-do with this particular ID. We are also adding a log with the level debug. We are printing logs with the debug level that a to-do was created successfully with this particular ID. This will not be visible in our production systems or our production log. And finally we are logging another
important event which is a business event log for the operation to-do is created and all the associated metadata with it. The ID of the to-do with the title, the category ID and the priority. And this is how a workflow looks like with all the logging and observability. And we'll see the monitoring part just a bit in our New Relic dashboard. So going back let me change it back to JSON so
that these logs are visible in our New Relic dashboard and I have made the log as JSON and change the value to production for the server. Uh this is the dashboard of New Relic and if we click on our application these are some of the data error rates that transaction time etc etc. Since there was no activity in the last 30 minutes, there was no data to display here. But let's
trigger some requests here. Say for get all to-dos. This is the open API interface where we can test our APIs. And if we let if we fire this API a couple of times since we have not provided any token, we are getting an error of unauthorized. But we want to check if this error gets logged in our dashboard. trigger this a couple of times. Then going back, go to errors in our dashboard. Let's do a refresh here and going to the HTTP error type. Now
here we can see that uh we have got a couple of unauthorized errors and these are the related logs as we have discussed before in our whole logging, monitoring and observability setup. All these things work together so that we have a complete understanding of our system so that we can debug better. So coming from our metrics, these things are called metrics. If we go here, if we go to the summary part, these are called the metrics. What is the average
transaction time, the throughput, the errors, these are the metrics, actual numbers that we can see so that we can quantify the state of our system. And from the metrics, let's say from error, it's around 80% average. From the errors, we can find the logs. And if we click on this log, the unauthorized part, we can see all the information that was associated with this log. This is the name of our application. There's the environment that was running on.
This is the error code who is unauthorized, the host name, the IP, the level of the error, the message, what is the method, the get method, what is the API route that we are trying to fetch, which is the /todos part and the span ID, the timestamp and everything related to this log. This part is called a log. And this is also connected to a trace. As I said, we have three pillars. One is
logging, second is metrics and the third one is traces. Now if we click on this, this is the whole trace that we can look more into and get more information from here. Similarly, if we want to take a look at our transactions in transactions, we have in code uh at this part we are talking about transaction and we are adding more attributes. transaction and the whole path of the transaction that was getting tracked here and we can track and we can see the results of all those transaction from
this dashboard. So if we click on this /todos we can get all these different data for this particular transaction different error rate the different response time and all these different data. Similarly, if we click on the go runtime, we can get more information about the system where our application backend application is actually running. This is the garbage collection time, the memory uses which is just 3 MB, the
throughput and the average response time and all these different things. So, this is what I wanted to show in this video. The whole discussion about logging, monitoring and observability. This is just a practice and we implement these practices in a spectrum. We cannot say that a system is completely observable. It is logging and monitoring the 100% of the parameters. But we implement this in a spectrum and we have different
different tools to record and measure them. If we are going with the open source route, we have Graphana, we have Loki, we have Prometheus, Jagger for the traces. And if we are going with a proprietary software so that our whole setup is a little more simplified then we have services like Newelic data dog etc etc. And using these services whether it is opensource or it is proprietary we can get all these different information about the state of
our service the state of our application the state of our infrastructure different interaction of different different components and everything we can imagine. Of course, we have to actually implement that as we saw in our code to have that result in the first place. But assuming that we have done everything from our part in the code side, we'll have a complete understanding, a complete dashboard view in whatever software that we are using, it does not matter. But this whole
workflow of logging, monitoring and observability is a collective effort. You as a developer have to do it on code level and the infrastructure people the devops people they also have to have the correct setup so that they can collect they can monitor all these metrics and they can collect all these logs and traces. So that's all there is to talk about logging monitoring and observability. There is not particularly skill that you have to learn but you have to keep in mind that this is a very important part of any production